For this project we will be doing two different case studies. In the first one we will be using linear regression models to analyse air humidity, in the second one we will be exploring water potability by implementing some GLM models.
In densely populated and polluted urban areas it can be very useful to be able to predict the relative humidity in the air. Humidity affects the dispersion and dilution of air pollutants. In high humidity conditions, pollutants may be more likely to form aerosols or droplets, leading to their removal from the air through processes like rain or gravitational settling. Conversely, low humidity can result in the suspension of fine particles, contributing to poor air quality. Moreover predicting air humidity it is crucial for maintaining energy efficiency in buildings and predicting the effects in the integrity of infrastructure, such as bridges, buildings, and roads. In this study case we will implement some linear regression techniques to try to asses this challenge.
The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level, within an Italian city. Data were recorded from March 2004 to February 2005.
Note: The missing values of the observations have been targeted with the value -200.
Data source: https://archive.ics.uci.edu/dataset/360/air+quality
The data set AirQualityUCI.csv includes the following 15 variables:
Date: (String) Date of the recording (MM/DD/YYYY).
Time: (String) Time of the recording (HH:MM:SS).
CO.GT.: (Integer) True hourly averaged concentration CO in mg/m^3.
PT08.S1.CO: (Categorical) hourly averaged sensor response (nominally CO targeted).
NMHC.GT.: (Integer) True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3.
C6H6.GT.: (Continuous) True hourly averaged Benzene concentration in microg/m^3.
PT08.S2.NMHC.: (Categorical) hourly averaged sensor response (nominally NMHC targeted).
NOx.GT.: (Integer) True hourly averaged NOx concentration in ppb.
PT08.S3.NOx.: (Categorical) hourly averaged sensor response (nominally NOx targeted).
NO2.GT.: (Integer) True hourly averaged NO2 concentration in microg/m^3.
PT08.S4.NO2.: (Categorical) hourly averaged sensor response (nominally NO2 targeted).
PT08.S5.O3.: (Categorical) hourly averaged sensor response (nominally O3 targeted).
T: (Continuous) Temperature measured in Celsius degrees.
RH: (Continuous) Relative humidity in percentage.
AH: (Continuous) Absolute humidity.
The aim of this project is to predict the relative humidity based on other air characteristics. We will not take into account the time dependency of the observations but we will consider the time and the month of the observations possible relevant features of the observations.
First we will load the data and display the data types of each feature.
## 'data.frame': 9471 obs. of 17 variables:
## $ Date : chr "3/10/2004" "3/10/2004" "3/10/2004" "3/10/2004" ...
## $ Time : chr "18:00:00" "19:00:00" "20:00:00" "21:00:00" ...
## $ CO.GT. : num 2.6 2 2.2 2.2 1.6 1.2 1.2 1 0.9 0.6 ...
## $ PT08.S1.CO. : int 1360 1292 1402 1376 1272 1197 1185 1136 1094 1010 ...
## $ NMHC.GT. : int 150 112 88 80 51 38 31 31 24 19 ...
## $ C6H6.GT. : num 11.9 9.4 9 9.2 6.5 4.7 3.6 3.3 2.3 1.7 ...
## $ PT08.S2.NMHC.: int 1046 955 939 948 836 750 690 672 609 561 ...
## $ NOx.GT. : int 166 103 131 172 131 89 62 62 45 -200 ...
## $ PT08.S3.NOx. : int 1056 1174 1140 1092 1205 1337 1462 1453 1579 1705 ...
## $ NO2.GT. : int 113 92 114 122 116 96 77 76 60 -200 ...
## $ PT08.S4.NO2. : int 1692 1559 1555 1584 1490 1393 1333 1333 1276 1235 ...
## $ PT08.S5.O3. : int 1268 972 1074 1203 1110 949 733 730 620 501 ...
## $ T : num 13.6 13.3 11.9 11 11.2 11.2 11.3 10.7 10.7 10.3 ...
## $ RH : num 48.9 47.7 54 60 59.6 59.2 56.8 60 59.7 60.2 ...
## $ AH : num 0.758 0.726 0.75 0.787 0.789 ...
## $ X : logi NA NA NA NA NA NA ...
## $ X.1 : logi NA NA NA NA NA NA ...
Cleaning the data set
# Count NaN values in each column
nan_counts <- colSums(is.na(df_air))
# Print the result
print(nan_counts)## Date Time CO.GT. PT08.S1.CO. NMHC.GT.
## 0 0 114 114 114
## C6H6.GT. PT08.S2.NMHC. NOx.GT. PT08.S3.NOx. NO2.GT.
## 114 114 114 114 114
## PT08.S4.NO2. PT08.S5.O3. T RH AH
## 114 114 114 114 114
## X X.1
## 9471 9471
Let’s delete the instances and the columns with null values:
# Delete the last two columns
df_air <- df_air[, -c((ncol(df_air) - 1):ncol(df_air))]
# Drop rows with all NaN values
df_air <- na.omit(df_air)
# Drop rows with NaN values in the 'RH' column
df_air <- df_air[complete.cases(df_air$RH), ]
# Count NaN values in each column after cleaning
nan_counts <- colSums(is.na(df_air))
# Print the result after deleting NaN values
print(nan_counts)## Date Time CO.GT. PT08.S1.CO. NMHC.GT.
## 0 0 0 0 0
## C6H6.GT. PT08.S2.NMHC. NOx.GT. PT08.S3.NOx. NO2.GT.
## 0 0 0 0 0
## PT08.S4.NO2. PT08.S5.O3. T RH AH
## 0 0 0 0 0
Creating Hour and Month columns
In order to consider the hour and the month as features of our problem we will convert the columns and extract the values and add two new columns as new features.
## 'data.frame': 9357 obs. of 15 variables:
## $ Date : Date, format: "2004-03-10" "2004-03-10" ...
## $ Time : chr "18:00:00" "19:00:00" "20:00:00" "21:00:00" ...
## $ CO.GT. : num 2.6 2 2.2 2.2 1.6 1.2 1.2 1 0.9 0.6 ...
## $ PT08.S1.CO. : int 1360 1292 1402 1376 1272 1197 1185 1136 1094 1010 ...
## $ NMHC.GT. : int 150 112 88 80 51 38 31 31 24 19 ...
## $ C6H6.GT. : num 11.9 9.4 9 9.2 6.5 4.7 3.6 3.3 2.3 1.7 ...
## $ PT08.S2.NMHC.: int 1046 955 939 948 836 750 690 672 609 561 ...
## $ NOx.GT. : int 166 103 131 172 131 89 62 62 45 -200 ...
## $ PT08.S3.NOx. : int 1056 1174 1140 1092 1205 1337 1462 1453 1579 1705 ...
## $ NO2.GT. : int 113 92 114 122 116 96 77 76 60 -200 ...
## $ PT08.S4.NO2. : int 1692 1559 1555 1584 1490 1393 1333 1333 1276 1235 ...
## $ PT08.S5.O3. : int 1268 972 1074 1203 1110 949 733 730 620 501 ...
## $ T : num 13.6 13.3 11.9 11 11.2 11.2 11.3 10.7 10.7 10.3 ...
## $ RH : num 48.9 47.7 54 60 59.6 59.2 56.8 60 59.7 60.2 ...
## $ AH : num 0.758 0.726 0.75 0.787 0.789 ...
## - attr(*, "na.action")= 'omit' Named int [1:114] 9358 9359 9360 9361 9362 9363 9364 9365 9366 9367 ...
## ..- attr(*, "names")= chr [1:114] "9358" "9359" "9360" "9361" ...
# Create a new feature with only the month
df_air$Month <- month(df_air$Date)
# Convert the time string to an hms object
df_air$Time <- hms(df_air$Time)
# Create a new feature with only the hour
df_air$Hour <- hour(df_air$Time)Treating with missing values
Missing values have been tagged with -200 value. So we will replace the values tagged as -200 of each feature with the median.
missing_values_targeted <- apply(df_air, 2, function(x) sum(x == -200))
print(missing_values_targeted)## Date Time CO.GT. PT08.S1.CO. NMHC.GT.
## 0 0 0 366 8443
## C6H6.GT. PT08.S2.NMHC. NOx.GT. PT08.S3.NOx. NO2.GT.
## 0 366 1639 366 1642
## PT08.S4.NO2. PT08.S5.O3. T RH AH
## 366 366 0 0 0
## Month Hour
## 0 0
Since the feature called NMHC.GT. has 8443 missing value we will not be considering this specific feature. The instances with relative humidity (RH) equal to -200 will be removed from the data set.
#Remove the column NMHC.GT. from the dataset
df_air <- df_air[, !(names(df_air) %in% "NMHC.GT.")]
# Replace values equal to -200 with NA in the 'RH' column
df_air$RH[df_air$RH == -200] <- NA
# Remove rows where 'RH' is NA
df_air <- df_air[!is.na(df_air$RH), ]In order to do a more accurate data imputation we will be taking into account the values of the features grouped by hour. We replace the values tagged as -200 by the median value by hour of each corresponding column. We will do this for all the columns that present -200 values.
# Replace PT08.S1.CO. values equal to -200 with NA
df_air$PT08.S1.CO.[df_air$PT08.S1.CO. == -200.0] <- NA
# Group by hour and calculate the median for each group
df_air <- within(df_air, {
PT08.S1.CO. <- ifelse(is.na(PT08.S1.CO.), ave(PT08.S1.CO., Hour, FUN = function(x) median(x, na.rm = TRUE)), PT08.S1.CO.)
})
# Replace PT08.S2.NMHC. values equal to -200 with NA
df_air$PT08.S2.NMHC.[df_air$PT08.S2.NMHC. == -200.0] <- NA
# Group by hour and calculate the median for each group
df_air <- within(df_air, {
PT08.S2.NMHC. <- ifelse(is.na(PT08.S2.NMHC.), ave(PT08.S2.NMHC., Hour, FUN = function(x) median(x, na.rm = TRUE)), PT08.S2.NMHC.)
})
# Replace NOx.GT. values equal to -200 with NA
df_air$NOx.GT.[df_air$NOx.GT. == -200.0] <- NA
# Group by hour and calculate the median for each group
df_air <- within(df_air, {
NOx.GT. <- ifelse(is.na(NOx.GT.), ave(NOx.GT., Hour, FUN = function(x) median(x, na.rm = TRUE)), NOx.GT.)
})
# Replace PT08.S3.NOx. values equal to -200 with NA
df_air$PT08.S3.NOx.[df_air$PT08.S3.NOx. == -200.0] <- NA
# Group by hour and calculate the median for each group
df_air <- within(df_air, {
PT08.S3.NOx. <- ifelse(is.na(PT08.S3.NOx.), ave(PT08.S3.NOx., Hour, FUN = function(x) median(x, na.rm = TRUE)), PT08.S3.NOx.)
})
# Replace NO2.GT. values equal to -200 with NA
df_air$NO2.GT.[df_air$NO2.GT. == -200.0] <- NA
# Group by hour and calculate the median for each group
df_air <- within(df_air, {
NO2.GT. <- ifelse(is.na(NO2.GT.), ave(NO2.GT., Hour, FUN = function(x) median(x, na.rm = TRUE)), NO2.GT.)
})
# Replace PT08.S4.NO2. values equal to -200 with NA
df_air$PT08.S4.NO2.[df_air$PT08.S4.NO2. == -200.0] <- NA
# Group by hour and calculate the median for each group
df_air <- within(df_air, {
PT08.S4.NO2. <- ifelse(is.na(PT08.S4.NO2.), ave(PT08.S4.NO2., Hour, FUN = function(x) median(x, na.rm = TRUE)), PT08.S4.NO2.)
})
# Replace PT08.S4.NO2. values equal to -200 with NA
df_air$PT08.S4.NO2.[df_air$PT08.S4.NO2. == -200.0] <- NA
# Group by hour and calculate the median for each group
df_air <- within(df_air, {
PT08.S4.NO2. <- ifelse(is.na(PT08.S4.NO2.), ave(PT08.S4.NO2., Hour, FUN = function(x) median(x, na.rm = TRUE)), PT08.S4.NO2.)
})
# Replace PT08.S5.O3. values equal to -200 with NA
df_air$PT08.S5.O3.[df_air$PT08.S5.O3. == -200.0] <- NA
# Group by hour and calculate the median for each group
df_air <- within(df_air, {
PT08.S5.O3. <- ifelse(is.na(PT08.S5.O3.), ave(PT08.S5.O3., Hour, FUN = function(x) median(x, na.rm = TRUE)), PT08.S5.O3.)
})We will split our data into train and test sets. Since we are not considering the temporal dependency of the observations, in order to get an illustrative training and testing set we will be shuffling the data before splitting it into train and test.
# Set a seed for reproducibility
set.seed(123)
# Use sample to create a train/test split with shuffling
index <- sample(1:nrow(df_air), 0.8 * nrow(df_air))
# Create training and testing sets
df_air_train <- df_air[index, ]
df_air_test <- df_air[-index, ]Even if we will not assume time dependency it could be interesting to visualize the RH in a timeline for analytic purpouses.
# Create a line plot of 'RH' over time
ggplot(df_air_train, aes(x = Date, y = RH)) +
geom_line() +
labs(title = "Relative Humidity Over Time",
x = "Time",
y = "Relative Humidity")We observe that there is variability in the observations and the time of the year could have an influence on relative humidity.
For our model we will use the features Month and Hour and not the Date and Time variables, so we will delete the first two columns.
# Eliminate the first two columns
df_air_train <- df_air_train[, -c(1, 2)]
df_air_test<- df_air_test[, -c(1, 2)]Let’s show the distribution of the relative humidity.
# Create a histogram with relative frequencies of the 'RH' column
hist(df_air_train$RH, main = "Relative Humidity Histogram", xlab = "Relative Humidity", col = "skyblue", border = "black", freq = FALSE)See that the distribution is symmetric so we won’t be transforming our variable for the predictions.
We will calculate the correlation matrix to study the linear dependencies of the variables.
# Calculate the correlation matrix
cor_matrix <- cor(df_air_train)
# Set the overall size of the plot (adjust as needed)
par(mfrow = c(1, 1), mar = c(0.5, 0.5, 0.5, 0.5),oma = c(1,1, 1, 1))
# Create a correlation heatmap with corrplot
corrplot(cor_matrix, method = "color", col = colorRampPalette(c("blue", "white", "red"))(100),
addCoef.col = "black", tl.col = "black",tl.cex = 0.7,number.cex = 0.6, tl.srt = 45)corr_RH <- sort(cor(df_air_train)["RH",], decreasing = T)
corr_data <- data.frame(variable = names(corr_RH), correlation = corr_RH)
ggplot(corr_data, aes(x = variable, y = correlation)) +
geom_bar(stat = "identity", fill = "lightgreen") +
scale_x_discrete(limits = corr_data$variable) +
labs(x = "", y = "Correlation with RH", title = "Correlations with RH") +
theme(plot.title = element_text(hjust = 0, size = rel(1.5)),
axis.text.x = element_text(angle = 45, hjust = 1))The highest correlated feature to RH is temperature (T) that has a negative correlation of -0.52, meaning that based on the data provided, as the temperature increases the relative humidity decreases with some linear dependency.
In the following sections we will implement some linear regression models.
Since the highest correlation with the relative humidity (RH) is attributed to the variable temperature (T) we will be selecting this attribute for our linear regression model. As we studied before we have seen that the relative humidity has a symmetrical distribution so we will not be transforming the variable RH.
# Fit the linear regression model
linFit <- lm(RH ~ T, data = df_air_train)
# Display the summary of the linear regression model
summary(linFit)##
## Call:
## lm(formula = RH ~ T, data = df_air_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.173 -9.987 0.350 10.329 38.927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.06670 0.37823 185.25 <2e-16 ***
## T -1.13509 0.01862 -60.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.02 on 7190 degrees of freedom
## Multiple R-squared: 0.3408, Adjusted R-squared: 0.3407
## F-statistic: 3717 on 1 and 7190 DF, p-value: < 2.2e-16
This summary includes essential information such as the coefficients, standard errors, t-values, and p-values for each predictor variable, as well as overall statistics like the R-squared value and adjusted R-squared value. An R-squared value of 0.34 means that approximately 34% of the variability in the dependent variable (Relative Humidity, RH) can be explained by the linear relationship with the predictor variable Temperature (T).
Let’s evaluate the how this linear model performs with the test evaluation set.
simplepred <- predict(linFit, newdata=df_air_test, interval = "prediction")
cor(df_air_test$RH, simplepred)^2## fit lwr upr
## [1,] 0.3110872 0.3111422 0.3110321
A value of 0.3111 indicates that approximately 31.11% of the variability in the observed relative humidity can be attributed to the linear relationship with the temperature.The lower and upper limit of the confidence interval for the R-squared value are close to the R-squared value so the confidence interval for this metric is relatively narrow.
# Create a scatter plot with the prediction intervals
ggplot(df_air_test, aes(x = T, y = RH)) +
geom_point(color = "red") + # Set point color to red
geom_ribbon(aes(ymin = simplepred[,"lwr"], ymax = simplepred[,"upr"]), fill = "blue", alpha = 0.2) + # Set ribbon color to blue
labs(title = "Prediction Intervals", y = "RH") + theme_minimal() +
# Add prediction intervals to the plot
geom_line(aes(y = simplepred[, "fit"]), color = "green") + # Prediction line
geom_ribbon(aes(ymin = simplepred[, "lwr"], ymax = simplepred[, "upr"]), fill = "green", alpha = 0.2) # Prediction interval ribbonThis graph provides a visual understanding of how well the linear regression model captures the relationship between temperature and relative humidity. The red points allow for a direct comparison between observed and predicted RH values, while the blue and green ribbons show the variability and uncertainty associated with individual predictions and the overall trend, respectively.
Let’s see the precentage of points covered by the intervals.
# Extracting prediction intervals
lower_bound_s <- simplepred[, "lwr"]
upper_bound_s <- simplepred[, "upr"]
# Adding prediction intervals to the test data
df_air_test$Lower_s <- lower_bound_s
df_air_test$Upper_s <- upper_bound_s
# Counting the points outside the intervals
outside_interval_count_s <- sum(df_air_test$RH < df_air_test$Lower_s | df_air_test$RH > df_air_test$Upper_s)
# Calculating the coverage
total_points_s <- nrow(df_air_test)
coverage_s <- round(100 - (outside_interval_count_s / total_points_s) * 100, digits = 1)
# Printing the coverage
print(paste("Percentage of points inside the intervals:", coverage_s, "%"))## [1] "Percentage of points inside the intervals: 95.4 %"
We will be know implementing a multiple linear regression model with the aim of obtaining a higher R-square value, and finally a better predicting model. Since we have see that the absolute humidity (AH) is strongly correlated with the temperature, with a correlation value of 0.66, we have decided to include this variable in our model. Testing other possible multiple regression models with different sets of features has showed that the model that performs best is the one we have selected to show here.
# Fit the multiple linear regression model
multiFit <- lm(RH ~ T + AH, data = df_air_train)
# Display the summary of the linear regression model
summary(multiFit)##
## Call:
## lm(formula = RH ~ T + AH, data = df_air_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.4266 -3.6134 -0.5059 3.8116 26.9026
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.56777 0.21168 238.9 <2e-16 ***
## T -2.36289 0.01143 -206.7 <2e-16 ***
## AH 40.92467 0.25116 162.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.474 on 7189 degrees of freedom
## Multiple R-squared: 0.8595, Adjusted R-squared: 0.8595
## F-statistic: 2.2e+04 on 2 and 7189 DF, p-value: < 2.2e-16
We observe that the R-square value for this model is 0.8595 meaning that approximately 85.9% of the variability in the observed relative humidity can be attributed to the multilinear relationship between RH and the temperature and absolute humidity. Simply by adding one more variable to our model we have obtained a great improvement.
multiPred <- predict(multiFit, newdata = df_air_test, interval = "prediction")
# Create a scatter plot with the prediction intervals for multiple regression
ggplot(df_air_test, aes(x = T, y = RH)) +
geom_point(color = "red") + # Set point color to red
geom_ribbon(aes(ymin = multiPred[, "lwr"], ymax = multiPred[, "upr"]), fill = "blue", alpha = 0.2) + # Set ribbon color to blue
labs(title = "Prediction Intervals (Multiple Regression)", y = "RH") + theme_minimal() +
# Add prediction intervals to the plot
geom_ribbon(aes(ymin = multiPred[, "lwr"], ymax = multiPred[, "upr"]), fill = "green", alpha = 0.2) # Prediction interval ribbon## fit lwr upr
## [1,] 0.8719585 0.8719607 0.8719563
After doing the prediction we observe that the R-square value has reached 87.1%.
We will now see the percentage of points covered by the intervals.
# Extracting prediction intervals
lower_bound <- multiPred[, "lwr"]
upper_bound <- multiPred[, "upr"]
# Adding prediction intervals to the test data
df_air_test$Lower <- lower_bound
df_air_test$Upper <- upper_bound
# Counting the points outside the intervals
outside_interval_count <- sum(df_air_test$RH < df_air_test$Lower | df_air_test$RH > df_air_test$Upper)
# Calculating the coverage
total_points <- nrow(df_air_test)
coverage <- round(100 - (outside_interval_count / total_points) * 100, digits = 1)
# Printing the coverage
print(paste("Percentage of points inside the intervals:", coverage, "%"))## [1] "Percentage of points inside the intervals: 94.5 %"
For evaluating the different models we have obtained we consider a benchmark model, where the prediction for the relative humidity would be the mean value of the relative humidity in the training set.
mean_RH <- mean(df_air_train$RH)
# Create a benchmark model with constant mean predictions
benchmark_predictions <- rep(mean_RH, nrow(df_air_test))# Evaluate the benchmark model
# Calculate R-squared for the benchmark model
benchmark_r_squared <- 1 - sum((df_air_test$RH - benchmark_predictions)^2) / sum((df_air_test$RH - mean(df_air_test$RH))^2)
cat('Benchmark R-squared:', benchmark_r_squared, '\n')## Benchmark R-squared: -0.0007042575
# Calculate the Mean Square Error for the benchmark model
benchmark_mse <- mean((df_air_test$RH - benchmark_predictions)^2)
cat('Benchmark MSE:', benchmark_mse, '\n')## Benchmark MSE: 306.0251
We observe that the benchmark performs worse than our multiple regression model and our linear regression model. So our final model is the multilinear model with an R-squared of 87,1%.
The ability to predict water potability is crucial for several reasons such public health and environmental conservation. In this case study we will able to understand the complexity of this issue as well as identify some predictive modelling techniques to asses this problem.
The water_potability.csv file contains water quality metrics for 3276 different water bodies. The dataset contains 9 continuous variables and one binary that is the target of the dataset:
ph: pH water (ranges from 0 to 14).
Hardness: Capacity of water to precipitate soap in mg/L.
Solids: Total dissolved solids in ppm.
Chloramines: Amount of Chloramines in ppm.
Sulfate: Amount of Sulfates dissolved in mg/L.
Conductivity: Electrical conductivity of water in μS/cm.
Organic_carbon: Amount of organic carbon in ppm.
Trihalomethanes: Amount of Trihalomethanes in μg/L.
Turbidity: Measure of light emiting property of water in NTU.
Potability: Indicates if water is safe for human consumption. Potable 1 and Not potable 0
The goal of this project is to predict water potability based on water quality metrics.
First we will load the dataset.
## 'data.frame': 3276 obs. of 10 variables:
## $ ph : num NA 3.72 8.1 8.32 9.09 ...
## $ Hardness : num 205 129 224 214 181 ...
## $ Solids : num 20791 18630 19910 22018 17979 ...
## $ Chloramines : num 7.3 6.64 9.28 8.06 6.55 ...
## $ Sulfate : num 369 NA NA 357 310 ...
## $ Conductivity : num 564 593 419 363 398 ...
## $ Organic_carbon : num 10.4 15.2 16.9 18.4 11.6 ...
## $ Trihalomethanes: num 87 56.3 66.4 100.3 32 ...
## $ Turbidity : num 2.96 4.5 3.06 4.63 4.08 ...
## $ Potability : int 0 0 0 0 0 0 0 0 0 0 ...
Study of the NULL values:
There are only null values for the features ph, Sulfate and Trihalomethanes variables. We show as well the proportion with respect to the potability target.
# Count NaN values in each column
nan_counts <- colSums(is.na(df_water))
#Identify the variables with missing values
variables_with_missing <- colnames(df_water)[colSums(is.na(df_water)) > 0]
# Print the result
print(nan_counts)## ph Hardness Solids Chloramines Sulfate
## 491 0 0 0 781
## Conductivity Organic_carbon Trihalomethanes Turbidity Potability
## 0 0 162 0 0
target <- df_water$Potability
# Calculate the missing data rate for each value of the target variable and each variable with missing values
for (variable in variables_with_missing) {
missing_rate_potable_0 <- sum(is.na(df_water[df_water$Potability == 0, variable])) / sum(df_water$Potability == 0)
missing_rate_potable_1 <- sum(is.na(df_water[df_water$Potability == 1, variable])) / sum(df_water$Potability == 1)
cat("Missing data rate for", variable, "and Potability = 0:", missing_rate_potable_0, "\n")
cat("Missing data rate for", variable, "and Potability = 1:", missing_rate_potable_1, "\n")
}## Missing data rate for ph and Potability = 0: 0.1571572
## Missing data rate for ph and Potability = 1: 0.1384977
## Missing data rate for Sulfate and Potability = 0: 0.2442442
## Missing data rate for Sulfate and Potability = 1: 0.2292645
## Missing data rate for Trihalomethanes and Potability = 0: 0.05355355
## Missing data rate for Trihalomethanes and Potability = 1: 0.04303599
To achieve a more accurate data imputation we will fill all null values with the median of each column grouping by the potability target.
We will split our data set in training and testing sets as we did for the previous topic.
# Set a seed for reproducibility
set.seed(123)
# Use sample to create a train/test split with shuffling
index <- sample(1:nrow(df_water), 0.8 * nrow(df_water))
# Create training and testing sets
df_water_train <- df_water[index, ]
df_water_test <- df_water[-index, ]##
## 0 1
## 61.29771 38.70229
There is a 61.2% of the observations collected that are targeted as not potable. This could be solved by several methods such as oversampling the smaller class of undersampling the bigger class. Since the difference of proportion is not very significant we won’t be modifying the data set. Instead we will consider that our benchmark model could be to predict always non potable outcomes.
We will now study the correlation between the variables.
# Calculate the correlation matrix
cor_matrix <- cor(df_water_train)
# Set the overall size of the plot (adjust as needed)
par(mfrow = c(1, 1), mar = c(0.5, 0.5, 0.5, 0.5),oma = c(1,1, 1, 1))
# Create a correlation heatmap with corrplot
corrplot(cor_matrix, method = "color", col = colorRampPalette(c("blue", "white", "red"))(100),
addCoef.col = "black", tl.col = "black",tl.cex = 0.7,number.cex = 0.6, tl.srt = 45)Contrarily to our previous data set this data shows very little correlation between the variables.
If we plot the density of each of the features by the factor potability, we see that the distribution of the variables behaves very similarly for the two classes. This may be indicating that predicting the potability accurately might be a challenge.
df_water_train_long <- gather(df_water_train, key = "variable", value = "value", -Potability)
ggplot(df_water_train_long, aes(x = value, fill = factor(Potability))) +
geom_density(alpha = 0.5) +
facet_wrap(~ variable, scales = "free")It is crucial to carefully choose the evaluation metric for our models. We must weigh the consequences of predicting potability when the actual instance is non-potable against predicting non-potability when the actual instance is potable. We will analyze both the accuracy and the precision of the models and try to achieve some balance between the two metrics.
As said before since we have 61,2% of the training data targeted as non-potable water, our benchmark model will be then to predict that all data is non-potable.
# Create a vector of predictions where all instances are predicted as non-potable (class 0)
predictions_benchmark <- rep(0, nrow(df_water_test))
# Create a confusion matrix for the benchmark model
conf_matrix_benchmark <- table(Actual = df_water_test$Potability, Predicted = predictions_benchmark)
# Display the confusion matrix
print(conf_matrix_benchmark)## Predicted
## Actual 0
## 0 392
## 1 264
# Calculate accuracy for the benchmark model
accuracy_benchmark <- sum(diag(conf_matrix_benchmark)) / sum(conf_matrix_benchmark)
# Precision for class 0 (non-potable)
precision_benchmark <- conf_matrix_benchmark[1, 1] / sum(conf_matrix_benchmark[1,])
# Display the results
cat("Accuracy of the benchmark model:", accuracy_benchmark, "\n")## Accuracy of the benchmark model: 0.597561
## Precision of the benchmark model: 1
We will have a low accuracy but a precision of 100%. It’s important to note that this seemingly high precision is deceptive and indicative of a bad model. In reality, such a model is overly simplistic and doesn’t provide meaningful insights. A more sophisticated model should aim to strike a balance between accuracy and precision. We will implement some generalized linear models that will try to do this.
First GLM model
For our first attempt, the full_model includes all available features in the training dataset. Then we apply stepwise variable selection based on the Akaike Information Criterion (AIC) to create a more concise model, stored in selected_model. A weighted approach is applied to address class imbalance, assigning a higher weight (1.3 times) to instances where water is potable (class 1), if we don’t do this the model will predict all instances as non potable.
# Model fitting using glm
full_model <- glm(Potability ~ ., data = df_water_train, family = binomial,weights = ifelse(df_water_train$Potability == 1, 1.3, 1))
#Display the full model
summary(full_model)##
## Call:
## glm(formula = Potability ~ ., family = binomial, data = df_water_train,
## weights = ifelse(df_water_train$Potability == 1, 1.3, 1))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.876e-01 6.340e-01 0.296 0.7673
## ph -3.616e-03 2.542e-02 -0.142 0.8869
## Hardness -4.542e-04 1.128e-03 -0.403 0.6871
## Solids 7.523e-06 4.303e-06 1.748 0.0804 .
## Chloramines 3.548e-02 2.353e-02 1.508 0.1316
## Sulfate -1.242e-03 1.038e-03 -1.197 0.2313
## Conductivity -2.831e-04 4.666e-04 -0.607 0.5440
## Organic_carbon -2.196e-02 1.131e-02 -1.942 0.0521 .
## Trihalomethanes 1.586e-03 2.377e-03 0.667 0.5047
## Turbidity 1.375e-02 4.736e-02 0.290 0.7716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4025.4 on 2619 degrees of freedom
## Residual deviance: 4013.2 on 2610 degrees of freedom
## AIC: 3549.9
##
## Number of Fisher Scoring iterations: 4
## Start: AIC=3549.87
## Potability ~ ph + Hardness + Solids + Chloramines + Sulfate +
## Conductivity + Organic_carbon + Trihalomethanes + Turbidity
##
## Df Deviance AIC
## - ph 1 4013.2 3547.9
## - Turbidity 1 4013.3 3548.0
## - Hardness 1 4013.3 3548.0
## - Conductivity 1 4013.5 3548.2
## - Trihalomethanes 1 4013.6 3548.3
## - Sulfate 1 4014.6 3549.3
## <none> 4013.2 3549.9
## - Chloramines 1 4015.4 3550.1
## - Solids 1 4016.2 3550.9
## - Organic_carbon 1 4016.9 3551.6
##
## Step: AIC=3547.86
## Potability ~ Hardness + Solids + Chloramines + Sulfate + Conductivity +
## Organic_carbon + Trihalomethanes + Turbidity
##
## Df Deviance AIC
## - Turbidity 1 4013.3 3545.9
## - Hardness 1 4013.4 3546.0
## - Conductivity 1 4013.6 3546.2
## - Trihalomethanes 1 4013.6 3546.3
## - Sulfate 1 4014.6 3547.3
## <none> 4013.2 3547.9
## - Chloramines 1 4015.5 3548.1
## - Solids 1 4016.3 3548.9
## - Organic_carbon 1 4017.0 3549.7
##
## Step: AIC=3545.93
## Potability ~ Hardness + Solids + Chloramines + Sulfate + Conductivity +
## Organic_carbon + Trihalomethanes
##
## Df Deviance AIC
## - Hardness 1 4013.4 3544.1
## - Conductivity 1 4013.6 3544.3
## - Trihalomethanes 1 4013.7 3544.4
## - Sulfate 1 4014.7 3545.4
## <none> 4013.3 3545.9
## - Chloramines 1 4015.5 3546.2
## - Solids 1 4016.4 3547.0
## - Organic_carbon 1 4017.1 3547.8
##
## Step: AIC=3544.1
## Potability ~ Solids + Chloramines + Sulfate + Conductivity +
## Organic_carbon + Trihalomethanes
##
## Df Deviance AIC
## - Conductivity 1 4013.8 3542.5
## - Trihalomethanes 1 4013.9 3542.5
## - Sulfate 1 4014.8 3543.4
## <none> 4013.4 3544.1
## - Chloramines 1 4015.8 3544.4
## - Solids 1 4016.7 3545.3
## - Organic_carbon 1 4017.3 3546.0
##
## Step: AIC=3542.4
## Potability ~ Solids + Chloramines + Sulfate + Organic_carbon +
## Trihalomethanes
##
## Df Deviance AIC
## - Trihalomethanes 1 4014.2 3540.8
## - Sulfate 1 4015.1 3541.7
## <none> 4013.8 3542.4
## - Chloramines 1 4016.1 3542.7
## - Solids 1 4017.0 3543.6
## - Organic_carbon 1 4017.7 3544.3
##
## Step: AIC=3540.78
## Potability ~ Solids + Chloramines + Sulfate + Organic_carbon
##
## Df Deviance AIC
## - Sulfate 1 4015.6 3540.2
## <none> 4014.2 3540.8
## - Chloramines 1 4016.6 3541.2
## - Solids 1 4017.4 3541.9
## - Organic_carbon 1 4018.2 3542.7
##
## Step: AIC=3540.09
## Potability ~ Solids + Chloramines + Organic_carbon
##
## Df Deviance AIC
## <none> 4015.6 3540.1
## - Chloramines 1 4017.9 3540.4
## - Solids 1 4019.5 3542.0
## - Organic_carbon 1 4019.7 3542.1
##
## Call:
## glm(formula = Potability ~ Solids + Chloramines + Organic_carbon,
## family = binomial, data = df_water_train, weights = ifelse(df_water_train$Potability ==
## 1, 1.3, 1))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.116e-01 2.573e-01 -1.211 0.2259
## Solids 8.358e-06 4.230e-06 1.976 0.0482 *
## Chloramines 3.570e-02 2.349e-02 1.520 0.1286
## Organic_carbon -2.268e-02 1.128e-02 -2.011 0.0444 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4025.4 on 2619 degrees of freedom
## Residual deviance: 4015.6 on 2616 degrees of freedom
## AIC: 3540.1
##
## Number of Fisher Scoring iterations: 4
# Model predictions on the test set
predictions <- predict(selected_model, newdata = df_water_test, type = "response")
predicted_classes <- ifelse(predictions > 0.5, 1, 0)
# Create a confusion matrix
conf_matrix <- table(Actual = df_water_test$Potability, Predicted = predicted_classes)
# If there is only one column, add a column for the other class (0)
if (ncol(conf_matrix) == 1) {
conf_matrix <- cbind(conf_matrix, 0)
}
# Display the confusion matrix
print(conf_matrix)## Predicted
## Actual 0 1
## 0 374 18
## 1 245 19
The formula selected by the step function is Solids + Chloramines + Organic_carbon.
Evaluation of the model:
# Calculate accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
cat("Accuracy of the with selected features: ", accuracy, "\n")## Accuracy of the with selected features: 0.5990854
# Calculate precision
precision <- conf_matrix[2, 2] / sum(conf_matrix[, 2])
cat("Precision of the with selected features: ", precision, "\n")## Precision of the with selected features: 0.5135135
This accuracy and precision is worse than our benchmark so we will need to consider the interactions between variables in order to achieve some improvement.
Second GLM model
After trying with several sets of features and severral variables interactions we have selected the following model, that provides an improvement and some balance between precision and accuracy.
# Better model fitting using glm
model <- glm(Potability ~ Sulfate*Solids + ph*Sulfate, data = df_water_train, family = binomial)
# Model summary
summary(model)##
## Call:
## glm(formula = Potability ~ Sulfate * Solids + ph * Sulfate, family = binomial,
## data = df_water_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.495e+01 2.550e+00 -9.783 < 2e-16 ***
## Sulfate 7.334e-02 7.620e-03 9.626 < 2e-16 ***
## Solids 3.218e-04 4.475e-05 7.191 6.43e-13 ***
## ph 2.530e+00 3.175e-01 7.969 1.60e-15 ***
## Sulfate:Solids -9.554e-07 1.344e-07 -7.106 1.20e-12 ***
## Sulfate:ph -7.626e-03 9.506e-04 -8.022 1.04e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3497.2 on 2619 degrees of freedom
## Residual deviance: 3352.8 on 2614 degrees of freedom
## AIC: 3364.8
##
## Number of Fisher Scoring iterations: 4
# Model predictions on the test set
predictions <- predict(model, newdata = df_water_test, type = "response")
# Convert predicted probabilities to binary predictions (0 or 1)
predicted_classes <- ifelse(predictions > 0.5, 1, 0)
# Create a confusion matrix
conf_matrix <- table(Actual = df_water_test$Potability, Predicted = predicted_classes)
# Display the confusion matrix
print(conf_matrix)## Predicted
## Actual 0 1
## 0 373 19
## 1 214 50
Plot of the variable’s interactions:
# Create an effect object for the interaction
interaction_effect <- effect("Sulfate*Solids", model, data = df_water_train)
# Plot the interaction
plot(interaction_effect, multiline = FALSE, rug=FALSE,ci.style="band", rescale.axis=FALSE)As we observe in the graph, for instances with high values in the variable Solids the probability of potability decreases when the sulfate increases.
# Create an effect object for the interaction
interaction_effect <- effect("Sulfate*ph", model, data = df_water_train)
# Plot the interaction
plot(interaction_effect, multiline = TRUE, rug=FALSE,ci.style="band", rescale.axis=FALSE)Here we see that for low ph the probability of potability increases as the sulfate value increases. Instead for high ph values the probability decreases as the sulfate increases in value.
Evaluation of the model:
# Calculate accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
cat("Accuracy: ", accuracy, "\n")## Accuracy: 0.6448171
# Calculate precision
precision <- conf_matrix[2, 2] / sum(conf_matrix[, 2])
cat("Precision: ", precision, "\n")## Precision: 0.7246377
This model provides with a reasonable balance between accuracy and precision, given the difficulty of the problem.
Even if the accuracy metrics obtained have been low, these models can be useful to study the interaction and relation of the variables of the problem.
We will now calculate the confidence interval of the betas. The following code will print the lower and upper bounds of the 95% confidence intervals for each coefficient in our logistic regression model
## 2.5 % 97.5 %
## (Intercept) -3.006870e+01 -2.006850e+01
## Sulfate 5.877307e-02 8.865512e-02
## Solids 2.356868e-04 4.112387e-04
## ph 1.921727e+00 3.166889e+00
## Sulfate:Solids -1.224164e-06 -6.967226e-07
## Sulfate:ph -9.534508e-03 -5.806089e-03
The intervals obtained are in the log-odds scale. If we are interested in odds ratios, we’ll need to exponentiate the coefficients and their confidence intervals.
## 2.5 % 97.5 %
## (Intercept) 8.736369e-14 1.924689e-09
## Sulfate 1.060535e+00 1.092704e+00
## Solids 1.000236e+00 1.000411e+00
## ph 6.832748e+00 2.373353e+01
## Sulfate:Solids 9.999988e-01 9.999993e-01
## Sulfate:ph 9.905108e-01 9.942107e-01
This provides us with the odds ratios and their confidence intervals, making the results more interpretable.
In this project we have observed that depending on the problem we are studying we could be facing different challenges. The achieved prediction metrics as well as their expected value varies depending on the nature of the problem under study. In our first case study, we successfully achieved an an R-squared value of 87.1% in predicting relative air humidity. This problem exhibited clear linear dependencies between variables, allowing us to implement a multivariable linear model with accurate predictions. On the other hand, our second case study presented a different scenario. The data was considerably noisier, and the relationship between water potability and water characteristics proved difficult to interpret. The interaction between variables was crucial to extract some information about the model. For the future it would be interesting to apply other techniques to these problems such as decision trees or support vector machines that might give us more insight and be able to do a more accurate prediction.